Pattern-Driven Data Cleaning

نویسندگان

  • El Kindi Rezig
  • Mourad Ouzzani
  • Walid G. Aref
  • Ahmed K. Elmagarmid
  • Ahmed R. Mahmood
چکیده

Data is inherently dirty and there has been a sustained effort to come up with different approaches to clean it. A large class of data repair algorithms rely on data-quality rules and integrity constraints to detect and repair the data. A well-studied class of integrity constraints is Functional Dependencies (FDs, for short) that specify dependencies among attributes in a relation. In this paper, we address three major challenges in data repairing: (1) Accuracy: Most existing techniques strive to produce repairs that minimize changes to the data. However, this process may produce incorrect combinations of attribute values (or patterns). In this work, we formalize the interaction of FD-induced patterns and select repairs that result in preserving frequent patterns found in the original data. This has the potential to yield a better repair quality both in terms of precision and recall. (2) Interpretability of repairs: Current data repair algorithms produce repairs in the form of data updates that are not necessarily understandable. This makes it hard to debug repair decisions and trace the chain of steps that produced them. To this end, we define a new formalism to declaratively express repairs that are easy for users to reason about. (3) Scalability: We propose a linear-time algorithm to compute repairs that outperforms state-of-the-art FD repairing algorithms by orders of magnitude in repair time. Our experiments using both real-world and synthetic data demonstrate that our new repair approach consistently outperforms existing techniques both in terms of repair quality and scalability.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unifacta: Profiling-driven String Pattern Standardization

Data cleaning is critical for effective data analytics on many real-world data collected. One of the most challenging data cleaning tasks is pattern standardization–reformatting ad hoc data, e.g., phone numbers, human names and addresses, in heterogeneous non-standard patterns (formats) into a standard pattern–as it is tedious and effort-consuming, especially for large data sets with diverse pa...

متن کامل

Data Mining for Actionable Knowledge: A Survey

The data mining process consists of a series of steps ranging from data cleaning, data selection and transformation, to pattern evaluation and visualization. One of the central problems in data mining is to make the mined patterns or knowledge actionable. Here, the term actionable refers to the mined patterns suggest concrete and profitable actions to the decision-maker. That is, the user can d...

متن کامل

A Novel Technique for Path Completion in Web Usage Mining

World Wide Web is a huge repository of web pages and links. The Web mining field encompasses a wide array of issues, primarily aimed at deriving actionable knowledge from the Web, and includes researchers from information retrieval, database technologies, and artificial intelligence. The growth of web is tremendous as approximately one million pages are added daily. Users’ accesses are recorded...

متن کامل

Query-Driven Approach to Entity Resolution

This paper explores “on-the-fly” data cleaning in the context of a user query. A novel Query-Driven Approach (QDA) is developed that performs a minimal number of cleaning steps that are only necessary to answer a given selection query correctly. The comprehensive empirical evaluation of the proposed approach demonstrates its significant advantage in terms of efficiency over traditional techniqu...

متن کامل

Private Exploration Primitives for Data Cleaning

Data cleaning is the process of detecting and repairing inaccurate or corrupt records in the data. Data cleaning is inherently human-driven and state of the art systems assume cleaning experts can access the data to tune the cleaning process. However, in sensitive datasets, like electronic medical records, privacy constraints disallow unfettered access to the data. To address this challenge, we...

متن کامل

Swarm Control of UAVs for Cooperative Hunting with DDDAS

Swarm control is a problem of increasing importance with technological advancements. Recently, governments have begun employing UAVs for reconnaissance, including swarms of drones searching for evasive targets. An agent-based simulation for dynamic cooperative cleaning is augmented with additional behaviors and implemented into a Dynamic Data-Driven Application System (DDDAS) framework for dyna...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1712.09437  شماره 

صفحات  -

تاریخ انتشار 2017